Textpresso: An Ontology-Based Information Retrieval and Extraction System for Biological Literature

نویسندگان

  • Hans-Michael Müller
  • Eimear E Kenny
  • Paul W Sternberg
چکیده

We have developed Textpresso, a new text-mining system for scientific literature whose capabilities go far beyond those of a simple keyword search engine. Textpresso's two major elements are a collection of the full text of scientific articles split into individual sentences, and the implementation of categories of terms for which a database of articles and individual sentences can be searched. The categories are classes of biological concepts (e.g., gene, allele, cell or cell group, phenotype, etc.) and classes that relate two objects (e.g., association, regulation, etc.) or describe one (e.g., biological process, etc.). Together they form a catalog of types of objects and concepts called an ontology. After this ontology is populated with terms, the whole corpus of articles and abstracts is marked up to identify terms of these categories. The current ontology comprises 33 categories of terms. A search engine enables the user to search for one or a combination of these tags and/or keywords within a sentence or document, and as the ontology allows word meaning to be queried, it is possible to formulate semantic queries. Full text access increases recall of biological data types from 45% to 95%. Extraction of particular biological facts, such as gene-gene interactions, can be accelerated significantly by ontologies, with Textpresso automatically performing nearly as well as expert curators to identify sentences; in searches for two uniquely named genes and an interaction term, the ontology confers a 3-fold increase of search efficiency. Textpresso currently focuses on Caenorhabditis elegans literature, with 3,800 full text articles and 16,000 abstracts. The lexicon of the ontology contains 14,500 entries, each of which includes all versions of a specific word or phrase, and it includes all categories of the Gene Ontology database. Textpresso is a useful curation tool, as well as search engine for researchers, and can readily be extended to other organism-specific corpora of text. Textpresso can be accessed at http://www.textpresso.org or via WormBase at http://www.wormbase.org.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Textpresso - an Information Retrieval and Extraction System for Biological Literature

We developed an information retrieval and extraction system that processes the full text of biological papers. The system, called Textpresso, separates text into sentences, labels words and phrases according to an ontology (an organized lexicon), and allows queries to be performed on a database of labeled sentences. The current ontology comprises approximately one hundred categories of terms, s...

متن کامل

Textpresso text mining:

Manual curation of experimental data from the biomedical literature is expensive and time-consuming; however, most biological knowledge bases still rely heavily on manual curation for data extraction and entry. We have developed and actively use a category-based information retrieval and extraction system for curating C. elegans proteins to the Gene Ontology's Cellular Component Ontology. The s...

متن کامل

Adaptation of an Open Source Semantic and Conceptual Retrieval Framework to the Astrobiological Domain

Introduction: Astrobiology is by nature a system-level science, meaning that it is concerned with complex , multidisciplinary, multi-phenomena behaviors of large physical and biological systems. Due to the breadth of the undertakings in astrobiological inquiry, researchers in the field must rely heavily on information technology to consolidate and represent knowledge and data from across many d...

متن کامل

Public Transport Ontology for Passenger Information Retrieval

Passenger information aims at improving the user-friendliness of public transport systems while influencing passenger route choices to satisfy transit user’s travel requirements. The integration of transit information from multiple agencies is a major challenge in implementation of multi-modal passenger information systems. The problem of information sharing is further compounded by the multi-l...

متن کامل

Presenting a method for extracting structured domain-dependent information from Farsi Web pages

Extracting structured information about entities from web texts is an important task in web mining, natural language processing, and information extraction. Information extraction is useful in many applications including search engines, question-answering systems, recommender systems, machine translation, etc. An information extraction system aims to identify the entities from the text and extr...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره 2  شماره 

صفحات  -

تاریخ انتشار 2004